perturbed leader
Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
We consider the problem of online learning and its application to solving minimax games. For the online learning problem, Follow the Perturbed Leader (FTPL) is a widely studied algorithm which enjoys the optimal $O(T^{1/2})$ \emph{worst case} regret guarantee for both convex and nonconvex losses. In this work, we show that when the sequence of loss functions is \emph{predictable}, a simple modification of FTPL which incorporates optimism can achieve better regret guarantees, while retaining the optimal worst-case regret guarantee for unpredictable sequences. A key challenge in obtaining these tighter regret bounds is the stochasticity and optimism in the algorithm, which requires different analysis techniques than those commonly used in the analysis of FTPL. The key ingredient we utilize in our analysis is the dual view of perturbation as regularization.
Review for NeurIPS paper: Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
Summary and Contributions: - It is known in the literature that optimistic variants of FTRL algorithm can yield better bounds when the sequence of loss functions are predictable. Such results are relatively rare for FTPL. This paper proposes the optimistic variant of the FTPL algorithm, which in the worst case known optimal bounds, but has the potential to achieve better regret for predictable sequence of loss functions. Specifically, the bounds depend on the g_t - abla_t _* where g_t is the estimate of the gradient for the next loss function and abla_t is the observed gradient. They instantiate this generic result for the worst case analysis via treating the future estimate g_t 0 and achieve the optimal O(T {\frac{1}{2}}) regret.
Review for NeurIPS paper: Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
The paper provides a follow the perturbed leader algorithm and analysis that can obtain better regret bounds when loss/gradient sequence is predictable. The proofs relies on using the equivalent regularization view of FTPL. The authors also provide an application of this result to providing a parallelizable algorithms for solving smooth convex concave saddlepoint games Most of the reviewers found the result interesting. Please address the concerns of the reviewer. Personally, I find the predictable sequences result interesting.
Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
We consider the problem of online learning and its application to solving minimax games. For the online learning problem, Follow the Perturbed Leader (FTPL) is a widely studied algorithm which enjoys the optimal O(T {1/2}) \emph{worst case} regret guarantee for both convex and nonconvex losses. In this work, we show that when the sequence of loss functions is \emph{predictable}, a simple modification of FTPL which incorporates optimism can achieve better regret guarantees, while retaining the optimal worst-case regret guarantee for unpredictable sequences. A key challenge in obtaining these tighter regret bounds is the stochasticity and optimism in the algorithm, which requires different analysis techniques than those commonly used in the analysis of FTPL. The key ingredient we utilize in our analysis is the dual view of perturbation as regularization.
Replicable Online Learning
Ahmadi, Saba, Bhandari, Siddharth, Blum, Avrim
We investigate the concept of algorithmic replicability introduced by Impagliazzo et al. 2022, Ghazi et al. 2021, Ahn et al. 2024 in an online setting. In our model, the input sequence received by the online learner is generated from time-varying distributions chosen by an adversary (obliviously). Our objective is to design low-regret online algorithms that, with high probability, produce the exact same sequence of actions when run on two independently sampled input sequences generated as described above. We refer to such algorithms as adversarially replicable. Previous works (such as Esfandiari et al. 2022) explored replicability in the online setting under inputs generated independently from a fixed distribution; we term this notion as iid-replicability. Our model generalizes to capture both adversarial and iid input sequences, as well as their mixtures, which can be modeled by setting certain distributions as point-masses. We demonstrate adversarially replicable online learning algorithms for online linear optimization and the experts problem that achieve sub-linear regret. Additionally, we propose a general framework for converting an online learner into an adversarially replicable one within our setting, bounding the new regret in terms of the original algorithm's regret. We also present a nearly optimal (in terms of regret) iid-replicable online algorithm for the experts problem, highlighting the distinction between the iid and adversarial notions of replicability. Finally, we establish lower bounds on the regret (in terms of the replicability parameter and time) that any replicable online algorithm must incur.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- South America > Suriname > North Atlantic Ocean (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
Solving Robust MDPs through No-Regret Dynamics
Guha, Etash Kumar, Lee, Jason D.
Reinforcement Learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. However, solving Markov Decision Processes that are robust to changes is difficult due to nonconvexity and size of action or state spaces. While most works have analyzed this problem by taking different assumptions on the problem, a general and efficient theoretical analysis is still missing. However, we generate a simple framework for improving robustness by solving a minimax iterative optimization problem where a policy player and an environmental dynamics player are playing against each other. Leveraging recent results in online nonconvex learning and techniques from improving policy gradient methods, we yield an algorithm that maximizes the robustness of the Value Function on the order of $\mathcal{O}\left(\frac{1}{T^{\frac{1}{2}}}\right)$ where $T$ is the number of iterations of the algorithm.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (2 more...)
Introduction to Multi-Armed Bandits
Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a review of the more advanced results. The chapters are as follows: Stochastic bandits; Lower bounds; Bayesian Bandits and Thompson Sampling; Lipschitz Bandits; Full Feedback and Adversarial Costs; Adversarial Bandits; Linear Costs and Semi-bandits; Contextual Bandits; Bandits and Zero-Sum Games; Bandits with Knapsacks; Incentivized Exploration and Connections to Mechanism Design.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Summary/Review (1.00)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.92)
- Education (1.00)
- Leisure & Entertainment > Games (0.92)
- Information Technology > Services (0.67)
- Marketing (0.67)
- Information Technology > Information Management > Search (1.00)
- Information Technology > Game Theory (1.00)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- (4 more...)
Online Non-Convex Learning: Following the Perturbed Leader is Optimal
Suggala, Arun Sai, Netrapalli, Praneeth
We study the problem of online learning with non-convex losses, where the learner has access to an offline optimization oracle. We show that the classical Follow the Perturbed Leader (FTPL) algorithm achieves optimal regret rate of $O(T^{-1/2})$ in this setting. This improves upon the previous best-known regret rate of $O(T^{-1/3})$ for FTPL. We further show that an optimistic variant of FTPL achieves better regret bounds when the sequence of losses encountered by the learner is `predictable'.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India (0.04)